# Cross-modal Reasoning

Ristretto 3B
Apache-2.0
Ristretto is an innovative vision-language model that employs dynamic image token deployment technology, allowing flexible adjustment of image token quantities based on task requirements, surpassing previous generations in performance and versatility.
Image-to-Text Transformers Supports Multiple Languages
R
LiAutoAD
732
2
Chattime 1 7B Chat
Apache-2.0
ChatTime is a multimodal foundation model that unifies time series and text processing, featuring zero-shot forecasting capabilities and supporting dual-modal input/output for both time series and text.
Multimodal Fusion Transformers
C
ChengsenWang
1,621
2
Chemvlm 26B
MIT
ChemVLM is a multimodal large language model focused on applications in the chemical field, combining text and image processing capabilities.
Image-to-Text Transformers
C
AI4Chem
53
21
Chameleon 7b
Other
Meta Chameleon is a hybrid-modality early-fusion foundational model developed by FAIR, supporting multimodal processing of images and text.
Multimodal Fusion Transformers
C
facebook
20.97k
179
Cogvlm Chat Hf
Apache-2.0
CogVLM is a powerful open-source vision-language model that achieves leading performance in multiple cross-modal benchmarks
Text-to-Image Transformers English
C
THUDM
4,816
193
Pix2struct Infographics Vqa Large
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained through multi-task learning for visual-language understanding tasks, specifically optimized for visual question answering on high-resolution infographics.
Image-to-Text Transformers Supports Multiple Languages
P
google
108
10
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase